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DEEP BOLTZMANN MACHINES IN ESTIMATION OE 
DISTRIBUTION ALGORITHMS FOR COMBINATORIAL 

OPTIMIZATION 

MALTE PROBST AND FRANZ ROTHLAUF 


Abstract. Estimation of Distribution Algorithms (EDAs) require flexible 
probability models that can be efflciently learned and sampled. Deep Boltz¬ 
mann Machines (DBMs) are generative neural networks with these desired 
properties. We integrate a DBM into an EDA and evaluate the performance 
of this system in solving combinatorial optimization problems with a single 
objective. We compare the results to the Bayesian Optimization Algorithm. 
The performance of DBM-EDA was superior to BOA for difficult additively de¬ 
composable functions which are separable, i.e., concatenated deceptive traps of 
higher order. For most other benchmark problems, DBM-EDA cannot clearly 
outperform BOA, or other neural network-based EDAs. In particular, it of¬ 
ten yields optimal solutions for a subset of the runs (with fewer evaluations 
than BOA), but is unable to provide reliable convergence to the global opti¬ 
mum competitively. At the same time, the model building process is fast, but 
computationally more expensive than that of other EDAs using probabilistic 
models from the neural network family, such as DAE-EDA. 


1. Introduction 

Estimation of Distribution Algorithms (EDAs) [HJIH] are metaheuristics for com¬ 
binatorial and continuous non-linear optimization. The maintain a population of 
candidate solutions to the optimization problem at hand (see Algorithm [T]) . First, 
they select solutions with a high quality from the population. Subsequently, they 
build a model that approximates the probability distribution of these solutions. 
Then, new candidate solutions are sampled from the model. The EDA then starts 
over by selecting the next set of good solutions from the new candidate solutions 
and the previous selection. 

In order to be suitable for an EDA, a model therefore has to fulfill certain criteria: 

• It must be able to approximate the probability distribution of the selected 
individuals. 

• It must be able to sample new solutions from this probability distribution, 
serving as candidate solutions for the next EDA generation. 

• Both learning and sampling should be efficient. That is, the computational 
time required to train and sample the model should be tractable both the 
number of variables, and in the number of training examples. 

Previous work has shown that generative neural networks can lead to competitive 
performance. [52] use a Restricted Boltzmann Machine (RBM) in an EDA and show 
that RBM-EDA can achieve competitive performance to state-of-the art EDAs, 
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especially in terms of computational complexity of the CPU time. [20] use another 
type of neural network, a Denoising Autoencoder (DAE) in an EDA. DAE-EDA 
achieves superior performance when used on problems which can be decomposed 
into independent subproblems. In general, neural-network inspired probabilistic 
models can often be parallelized on massively parallel systems such as graphics 
processing units (GPUs) [21] . 

In this paper, we focus on Deep Boltzmann Machines (DBMs) [23]. DBMs are 
deep models in the sense that they use multiple layers , j = 1... d of hidden (la¬ 
tent) neuron^. A DBM models the joint probability distribution P(v, h^,..., h'^) 
of the training data v and the hidden neurons. 

Deep architectures are particularly interesting, because they are able to model 
problems on multiple layers of abstraction. A deep model is usually composed of 
multiple layers of computational units (e.g., neurons). The concepts modeled by 
each layer becomes more abstract with the layer’s depth mm- An intuitive example 
is a deep neural network that learns to model images of faces [10] : Neurons on the 
first hidden layer learn to model individual edges and other shapes. Units on deeper 
layers compose these edges to form higher-level features, like noses or eyes. Again, 
by combining theses mid-level representations, neurons in the deepest layers can 
compose complete faces. Many real-world problems like image classification possess 
this kind of hierarchic structure with various layers of abstraction. Deep models 
have recently gained much attention, as they were able to yield superior results for 
various real-world problem domains [9]. 

We implement a DBM and use it within an EDA to solve combinatorial optimiza¬ 
tion problems. We test DBM-EDA on the simple onemax problem, concatenated 
deceptive trap functions, NK landscapes and the HIFF function. We compare the 
results the state-of-the-art multivariate Bayesian Optimization Algorithm (BOA, 
see [laiis]), and publish the source code of all experimentsQ 

Section [J] introduces DBMs. Section [3] describes benchmark problems, exper¬ 
imental setup, and presents the results. We discuss the results and conclude the 
paper in section]?] 


2. Deep Boltzmann Machines 

DBMs are special types of Boltzmann Machines, one of the fundamental types 
of neural networks [Tl[23]. A DBM has a visible layer v G [0,1]" and d hidden 

^We use the following notation: x denotes a scalar value, x denotes a vector of scalars, X 
denotes a matrix of scalars 

^See https://github.com/wohnjayne/eda-suite/ for the complete source code 


Algorithm 1 Estimation of Distribution Algorithm 

1: Initialize Population P 
2: while not converged do 

3: Pparents f— Select high-quality solutions from P based on their fitness 

4 : M -(r- Build a model estimating the (joint) probability distribution of Pparents 

5: Pcandidates ^ Sample new Candidate solutions from M 

6* P t Pparents G Pcandidates 

7 : end while 
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Figure 1. A Deep Boltzmann Machine with two hidden layers 
h^, as a graph. The visible neurons Vi {i £ l..n) can hold a data 
vector of length n from the training data. In the EDA context, v 
represents decision variables. The hidden neurons h] (j € 1... mi) 
and h\ [k £ 1.. .m 2 ) represent mi first-level and m 2 second-level 
features, respectively. 


layers h-^ £ [0,1]"*^. Each layer consists of binary units, and learns a non-linear 
representation of the data on the layer below. Hence, upper layers will learn more 
abstract concepts. 

Here, we focus on DBMs with two hidden layers and (see Eigure[T]). The 
layers of neurons are connected by symmetric weights. The weight matrix of 
size n* mi connects the visible layer to the first hidden layer. Weight wjj therefore 
connects Vi to /ij. Accordingly, weight matrix of size mi * m 2 connects hidden 
layer to hidden layer h^. There are no connections within the layers. 

Erom Boltzmann Machines, DBMs inherit the concept of a scalar energy associ¬ 
ated with each configuration of its neurons. The energy of the state {v, h^, h^} is 
defined as 


(1) A(v,h\h2;6») = 

where 6 = {W^, W^} are the model’s parametertH. The probability of a particular 
configuration of the visible neurons v under the model is 

(2) F(v;6») = ^ exp(-E(v,h\h2;6»)). 

^ ' hl,h2 

Z{9) is the partition function which normalizes the probability by summing over 
all possible configurations. 


^We omit the bias terms for brevity, see 1241 . 
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The conditional probabilities of the neurons, given the activations of their neigh¬ 
boring layers, is easier to calculate. Each neuron Vi calculates its conditional prob¬ 
ability to be active as 

(3) P{v, = l|hi) = sigm(^ h] * 

j 

with sigm(x) being the logistic function signi(a;) = The neurons in hidden 

layer calculate their probability to be active as 

(4) P{h] = l|v, h^) = sigm(^ Vj * + XI * ^1^)’ 

i k 

and the neurons in hidden layer calculate their probability to be active as 

(5) P{hl = l|hi) = signi(X h] * w%). 

3 


Algorithm 2 Pseudo code for pre-training a DBM with RBMs 

1: Initialize W^, to small, random values 
2: ^ Train RBMi modeling P(v,h^; W^), 

3: using Pparents Rs training set 

4: W2 ^ Train RBM 2 modeling P(h\h2; W^), 

5: using samples from P(H^|Pparents; W^) as defined by RBM^ as training set 


2.1. Training a DBM. In the training phase, the parameters 9 of the DBM have 
to be adjusted such that the model approximates the probability distribution of 
the training data. This could, in principle, be done by using the general training 
procedure of Boltzmann Machines [I]. However, this would be computationally 
intractable. Hence, a greedy, layer-wise pretraining is used to initialize the param¬ 
eters to some sensible values, and subsequently perform parameter fine-tuning on 
the complete DBM. 

During the pretraining phase, we consider the DBM to be a stack of Restricted 
Boltzmann Machines (RBMs) (see Algorithmic]). RBMs are similar to DBMs, but 
only have a single layer of hidden neurons. Contrastive divergence is a tractable 
learning algorithm to train RBMs [4]. Specifically, we consider v and to be an 
RBM, with as its parameters. We then train the resulting RBM to model the 
probability distribution of the training data, using contrastive divergence (see e.g. 
[HIS] for details). Subsequently, we train a second RBM, consisting of and 

The training data for the second RBM consists of the activations of the first RBM’s 
hidden layer on the original training dat£0. 

Once both RBMs have been trained, we use their parameters and as 
an initialization for the parameters 0 of a single DBM. Note that the pretraining 
does not adjust the weights s.t. the DBM’s conditional probability distribution 
P(v|h^,h^;0) approximates the probability distribution of the training data. 

In order to achieve this, we fine-tune the DBM (see Algorithm [3|) . This can 
be done using gradient descent algorithm to modify 9. The general idea of the 

^There is a small adjustment in the training procedure for the RBMs. In practice, the weight 
matrices and are multiplied by constant factors, see m for details. 
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Algorithm 3 Pseudo code for fine-tuning a DBM 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 
27 


Set 0 < a < 1, e.g. a = 0.1 

Initialize DBM’s parameters with pre-trained and 
Initialize fantasy particles v and randomly 
while not converged do 

for each example in the training set do 
— Positive phase — 

V <r- set V to the current training example 
h^ ■<— set to P(h^|v), 
ignoring input from h^ 
h^ ^ set to P(h^|h^) 
run mean field approximation: 
for a small number of iterations (e.g. ten) do 
h^ -I— set to P(h^|v, h^) 
h^ <— set to P(h^jh^) 
end for 

— Negative phase — 

for a small number of iterations (e.g. five) do 
h^ •(— activate stochastically with P(h^|v, h^) 
h^ -(— activate stochastically with P(h^|h^) 

V activate stochastically with P(v|h^) 

end for 

— Calculate and apply gradient — 

aE(v.h\h^;6») 

Opos — Qg 

X _ aE(v.hbh^;e) 

Oneg — Qg 

6 '.= 6 O =1 ((5pos “t” ^neg) 

end for 
end while 


algorithm is as follows: In order to increase the probability of a training data point v 
under the model, the energy P(v, h^, h^; 0) of this configuration has to be decreased 
(see Equation [2|). At the same time, the probability of all other configurations has 
to be lowered, by increasing their energies contained in the partition function Z. 

The gradient hence contains two terms. The first term (positive gradient) in¬ 
creases the probability of the current sample under the model, the second term 
(negative gradient) decreases the probabilities of all other configurations in Z. 

The negative gradient is calculated by running a separate Markov chain of fan¬ 
tasy particles in the model. Specifically, all neurons are first initialized randomly. 
Then, the following two steps are repeated: First, is activated stochastically, 
with probabilities P(h^|v, h^) as in Equation 01 Then, v and are activated 
stochastically, with P(v|h^) as in Equation [U and P(h^|h^) as in Equation [S] If 
the parameter updates are small enough, and the Markov chain is allowed to run a 
couple of steps between each update (e.g. five steps), the samples will come from 
the chain’s equilibrium distribution. The current state of the fantasy particle is 
then used to calculate the negative gradient. 

The positive term could be approximated by running the same Markov chain as 
above, but with v clamped to the current training example. However, this would 
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result in running the Markov chain for many steps, for each training example. 
Instead, the positive gradient can be approximated using a mean-field approach. 
First, neurons in v are set according to the current training example. Then, F’(h^) 
and, subsequently, P(h^|P(h^)) are calculated, as in Equations 0] and O Then, 
for a small number of steps (e.g. ten steps), P(h^|v, P(h^)) and P(h^|P(h^)) are 
calculated repeatedly. The resulting configuration of v, P(h^) and P(h^) is treated 
as a positive example, and used for the calculation of the positive gradient. The 
DBM’s parameters 9 are then updated in the direction of the total gradient. 

For a more detailed description of the training algorithm, see [53] . 

2.2. Sampling a DBM. The DBM can be sampled by initializing all neurons 
to random values, and running the same sampling chain used to retrieve fantasy 
particles for the negative gradient. In the case of DBM-EDA, we initialize v with 
the Pparents, and run the chain for 25 iterations. 

3. Experiments 

We evaluate DBM-EDA using a set of standard benchmark problems. We com¬ 
pare the results of DBM-EDA to those of the state-of-the-art multivariate BOA. 

3.1. Test Problems. We evaluate DBM-EDA on onemax, concatenated deceptive 
traps, NK landscapes and the HIFF function. All four are standard benchmark 
problems. Their difficulty depends on the problem size, i.e., problems with more 
decision variables are more difficult. Furthermore, the difficulty of concatenated 
deceptive trap functions and NK landscapes is tunable by a parameter. Apart from 
the simple onemax problem, all problems are composed of subproblems, which are 
either deceptive (traps), overlapping (NK landscapes), or hierarchical (HIFF), and 
therefore multimodal. 

The onemax problem assigns a binary solution x of length I a fitness value 
/ = X]i=i fitness of x is equal to the number of ones in x. The onemax 

function is rather simple. It is unimodal and can be solved by a deterministic hill 
climber. 

Concatenated deceptive traps are tunably hard, yet separable test problems [3]. 
Here, a solution vector x is divided into I subsets of size k, with each one being a 
deceptive trap. Within a trap, all bits are dependent on each other but independent 
of all other bits in x. Thus, the fitness contribution of the traps can be evaluated 
separately and the total fitness of the solution vector is the sum of these terms. In 
particular, the assignment a = Xi:i+fc_i (i.e., the k bits from Xi to Xi+fc_i)Q leads 
to a fitness contribution Fi as 

Ei(a) = |^ = 

\k — O'i + 1) otherwise. 

In other words, the fitness of a single trap increases with the number of zeros, except 
for the optimum of all ones. 

NK landscapes are defined by two parameters n and k and n fitness components 
/i, i G {1 ..., n} |5]. A solution vector x consists of n bits. The bits are assigned to 
n overlapping subsets, each of size fe -I- 1. The fitness of a solution is the sum of n 

®The k variables assigned to trap I do not have to be adjacent, but can be at any position in 



DBMS IN EDAS FOR COMBINATORIAL OPTIMIZATION 


7 


fitness components. Each component fi depends on the value of the corresponding 
variable Xi as well as k other variables. Each fi maps each possible configurations 
of its /c + 1 variables to a fitness value. The overall fitness function is 

n 

/(x) = l/ny^Ji{xi,xa,...,XiK)- 

i=l 

Each decision variable usually influences several fi. These dependencies between 
subsets make NK landscapes non-separable, i.e., in general, we cannot solve the 
subproblems independently. The problem difficulty increases with k. fc = 0 is a 
special case where all decision variables are independent and the problem reduces 
to a unimodal onemax. We use instances of NK landscapes with known optima 
from [16) . 

The Hierarchical If-and-only-if (HIEE) function [57] is defined for solutions vec¬ 
tors of length n = 2* where I €N is the number of layers of the hierarchy. It uses a 
mapping function M and a contribution function C, both of which take two inputs. 
The mapping function takes each of the n/2 blocks of two neighboring variables 
of level 1 = 1, and maps them onto a single symbol each. An assignment of 00 is 
mapped to 0, 11 is mapped to 1 and everything else is mapped to the null symbol 
The concatenation of M’s outputs on level I is used as M’s input for the next level 
Z -I- 1 of the hierarchy, i.e., if level I = 1 has n variables, level I = 2 has n/2 variables. 
On each level, C assigns a fitness to each block of two variables. The assignments 
00 and 11 are both mapped to 2\ everything else to 0. The total fitness is the sum 
of all blocks’ contributions on all levels. In other words, a block contributes to the 
fitness on the current level if both variables in a block have the same assignment. 
However, only if neighboring blocks agree on the assignment, they will contribute 
to the fitness on the next level, which is why HIEE is a difficult problem. HIEE has 
two global optima, the string of all ones, and the string of all zeros. 

3.2. Experimental Setup. We use several instances of the test problems. For 
each instance and algorithm, we test multiple population sizes between 100 and 
16,00(0. 

We run 20 instances for each population size. In each run, the ED As are allowed 
to run for up to 150 generations. We terminate a run if there is no improvement 
in the best solution for more than 50 generations. These settings make it very 
unlikely that a run is terminated prematurely, i.e., before convergence. Both DBM- 
EDA and BOA use tournament selection without replacement of size two m- Note 
that all test problems, with the exception of NK landscapes, have the string of all 
ones as their global optimum, for any problem size. To avoid any possible model- 
induced bias towards solutions with ones or zeros, we generate a random matrix 
R S [0,1]"*™ of ones and zeros for each run. In each generation, we apply the 
following operations. Before training, we set trainingData = Pparents © R, with 0 
being a logical XOR. After sampling we set Pcandidates = modelSamples 0 R. and 
after sampling. These operations are transparent to correlations between variables. 

We use standard values for all hyper-parameters governing the DBM’s learning 
and sampling procedures. All hyper-parameters, and further details of the learning 
process such as momentum or weight decay are available in a configuration file 
along with the source code (see git repository on github.com). 


®popsize S {100; 200; 300; 400; 500; 1,000; 1, 500; and 2, 000 to 16, 000 (increment 1000)} 
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The algorithms are implemented in Matlab/Octave and executed using Octave 
V3.2.4 on a on a single core of an AMD Opteron 6272 processor with 2,100 MHz. 
For the DBM, we used the source code provided by P5] . 

3.3. Results. Table[I]shows results for DBM-EDA and BOA on the Onemax prob¬ 
lem (50, 75, 100, and 150 bit problem size), concatenated 4-Traps (40, 60, and 80 
bit), concatenated 5-Traps (25, 50, and 100 bit), NK landscapes with k G {4, 5} (30 
and 34 bit, two instances each) and the HIFF function (64 and 128 bit). 

We report the population sizes, the average number of unique fitness evaluations, 
and the average CPU times that each algorithm needed to solve the respective 
problem instance to optimality in at least 50% and 90% of the runs. 

First, we concentrate on the number of fitness evaluations, and on the results for 
solving at least 50% of the runs (left three result columns of table [ij. For the simple 
onemax problem, DBM-EDA needs slightly less fitness evaluations than BOA. For 
the concatenated deceptive traps, the results are mixed. BOA finds the optimal 
solutions to the 4-Trap problems faster than DBM-EDA, while DBM-EDA seems 
to be more competitive for the harder 5-Traps problem. For the NK landscapes, 
DBM-EDA needs less fitness evaluations than BOA in six out of eight instances. On 
the HIFF problem, DBM-EDA’s performance is clearly inferior to BOA: it needs 
much more fitness evaluations on the 64 bit instance, and is unable to find the 
optimal solution in at least 50% of runs for the 128 bit instance. This is surprising, 
given that HIFF is a hierarchical problem, and the DBM is a hierarchical model. 
In theory, the DBM should have been able to hnd the building blocks for HIFF on 
the lower layer of its representation, and recombine them on the higher layers. 

We now look at the results for solving at least 90% of runs (right three result 
columns of table [T]). With the exception of the small onemax instances, and the 
concatenated 5-Traps problem, DBM-EDAs performance is inferior to BOA. In 
addition to the larger HIFF instance, DBM-EDA is unable to solve four of the NK 
landscape instances. In other words, while DBM-EDA was relatively competitive 
when the goal was to solve at least 50% of the instances to optimality, it seems 
to be less able to provide reliable convergence to the global optimum. A drastic 
example is the 150 bit instance of the simple onemax problem: DBM-EDA needs 
a population size of 200 to find the optimum in at least 50% of runs, but 6000 to 
find it in at least 90% of the runs. This behavior has also been observed in another 
EDA based on neural networks (DAE-EDA, see [18pl b 

Second, we look at the CPU times required to solve the problem. For most in¬ 
stances, DBM-EDA is faster than BOA. Note that the direct comparison of CPU 
times is not entirely fair for BOA. In a more efficient programming language in¬ 
stead of a script-based language like Matlab/Octave, BOA’s speedup is significantly 
higher than the one of DAE-EDA. However, neither is Matlab/Octave the best pro¬ 
gramming language for DBM-EDA: Almost every recent implementation of neural 
networks is parallelized on graphics processing units (GPU), which, in turn, speeds 
up training and sampling these models considerably (see e.g. [251 mile]). Paralleliz¬ 
ing multivariate EDAs such as BOA is well possible, however the speedups are often 
single- or double-digit, even on GPUs (see e.g. [Ml [13]). In contrast, parallelizing 
EDAs using neural networks can make proper use of modern GPU hardware and 
yield very high speedups: m report speedups of up to 200 x, against optimized 

^Note that this behavior is not so pronounced in later versions of DAE-EDA, which use the 
same model, but a different parametrization of the learning phase (publication m in preparation). 
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Problem 

Algo 

PopSize 

Averag 
Population size such 
in >50% of runs 

Unique Evals | Time (sec) 

e results 
ihat optin 

PopSize 

um is found 

in >90% of runs 

Unique Evals | Time (sec) 

ONEMAX50 

BOA 

DBM 

125 

100 

2,119±125 

1,700±98 

685±101 

442±81 

125 

100 

2,119±125 

1,700±98 

685±101 

442±81 

ONEMAX75 

BOA 

DBM 

125 

100 

2,787±158 

2,328±86 

2,182±321 

565±88 

125 

100 

2,787±158 

2,328±86 

2,182±321 

565±88 

ONEMAXlOO 

BOA 

DBM 

250 

200 

6.259±153 

5,592±195 

8.967±1,016 

1,641±306 

250 

200 

6,259±153 

5,592±195 

8,967±1,016 

1,641±306 

ONEMAX150 

BOA 

DBM 

250 

200 

7.698±270 

6,O95±503 

26,867±3,380 

3,555±283 

250 

6,000 

7,698±270 

212,464±6,846 

26,867±3,380 

73,024±8.317 

4-Traps 

40 bit 

BOA 

DBM 

500 

2,000 

8,682±429 

34,561±6,168 

1.894±323 

2.034±880 

1,000 

3,000 

13,673±758 

47,231±5,712 

2,728±297 

2,201±312 

4-Traps 

60 bit 

BOA 

DBM 

500 

4,000 

12,152±518 

89,495±4,723 

6.797±927 

5,481±416 

1,000 

5,000 

20,236±1,362 

104,967±12,482 

10.604±1,707 

6,793±686 

4-Traps 

80 bit 

BOA 

DBM 

1,000 

6,000 

26,377±780 

153,27S±6,149 

26.871±3,906 

13,271±1,295 

2,000 

6,000 

43,777±1,695 

153,278±6,149 

43.935±4,994 

13,271±1,295 

5-Traps 

25 bit 

BOA 

DBM 

1,000 

1,000 

11.032±877 

10.368±2,125 

1.023±245 

444±108 

1,500 

1,500 

14.924±1,028 

13,291±2,471 

1.384±211 

566±108 

5-Traps 

50 bit 

BOA 

DBM 

3,000 

3,000 

47.904±3,120 

51.367±19,948 

20,199±2,704 

3,168±2,087 

3,000 

4,000 

47.904±3,120 

49,886±11,933 

20,199±2,704 

3,060±617 

5-Traps 

75 bit 

BOA 

DBM 

4,000 

5,000 

90,802±2,712 

99,990±6,169 

86.908±8,345 

8,538±1,131 

6,000 

6,000 

119,044±4,353 

101,107±19,431 

119,275±15,826 

9,183±1,407 

5-Traps 

100 bit 

BOA 

DBM 

6,000 

8,000 

151,231±3,207 

169.700±12,290 

284.456±27,058 

26,802±6,659 

8,000 

8,000 

190,011±4,664 

169,700±12,290 

355,140±25.659 

26,802±6,659 

NK n = 30, 
k — 4. i = 1 

BOA 

DBM 

500 

300 

9.820±874 

5,976±941 

1.364±274 

453±88 

2,000 

5,000 

32,015±3,094 

69,081±9,269 

4,590±1,044 

2,120±576 

NK n = 30, 
k ^ 4, i = 2 

BOA 

DBM 

2,000 

2,000 

37,883±3,120 

30,124±3,941 

6.753±1,551 

1,508±310 

4,000 

9,000 

67,939±6,649 

126,982±13,840 

13.360±3,445 

3,529±680 

NK n ^ 34, 
k — 4, i = 1 

BOA 

DBM 

500 

400 

11.685±788 

8,823±1,227 

1.896±382 

567±105 

1,000 

4,000 

21,546±1,860 

69,125±6,476 

3.603±625 

2,807±631 

NK n = 34, 
fc ^ 4, i = 2 

BOA 

DBM 

2,000 

4,000 

41,260±3,544 

65,752±6,832 

9.178±1,885 

2,758±525 

5,000 

88,321±9,272 

20,377±5,123 

NK n = 30, 
k — i = 1 

BOA 

DBM 

250 

200 

5,83.5±867 

4,284±851 

7871221 

611±290 

500 

1,000 

11,221±644 

17,617±1,601 

1,565±260 

1,045±178 

NK n = 30, 
fc — 5, i = 2 

BOA 

DBM 

500 

2,000 

12,122±1,456 

33,963±3,124 

1.584±350 

1.773±247 

2,000 

41,641±5,157 

6,831±1,541 

NK n = 34, 
k — i = 1 

BOA 

DBM 

6,000 

6,000 

130.026±7,306 

106,392±10,966 

34,671±4,232 

4,153±793 

16,000 

307,171±24,227 

86,846±15,431 

NK n = 34, 
fc — 5, i = 2 

BOA 

DBM 

13,000 

10,000 

245.051±30,834 

192,125±18,928 

63,222±16,462 

6,752±1,254 

16,000 

300,058±42,914 

84,266±22,365 

HIFF64 

BOA 

DBM 

500 

3,000 

11,991±731 

64,073±4,283 

7.480±1,059 

7.697±843 

500 

3,000 

11,991±731 

64,073±4,283 

7.480±1,059 

7.697±843 

HIFF128 

BOA 

DBM 

1,000 

35,477±1,537 

99,782±10,278 

1,500 

51,008±2,387 

137,617±12,151 


Table 1. This table shows average results for fitness evaluations 
and CPU time for DBM-EDA and BOA for the test problems. For 


each instance and algorithm, we selected the minimal population 
size which leads to the optimal solution in at least 10 (left three 
result columns) or 18 (right three result columns) of 20 runs. Bold 
results are significantly smaller, according to a Wilcoxon signed- 
rank tests (p < 0.01, data is not normally distributed) 


CPU code, for RBM-EDA, which uses a neural network model that is closely re¬ 
lated to the DAE. Hence, it is reasonable to assume that an efficient GPU-based 
implementation of DBM-EDA will still be fast. 

However, other neural network based EDAs such as DAE-EDA are computation¬ 
ally less expensive. Hence, they need considerably less time for solving the same 
benchmark instances to optimality ([Mill HO])- 
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Figure 2. Visualization of the learned weight matrix of DBM- 
EDA optimizing a concatenated 5-trap problem with 30 bits, in a later 
EDA generation. 


4. Discussion and Conclusion 

We introduced DBM-EDA, an Estimation of Distribution Algorithm which uses a 
Deep Boltzmann Machine as its probabilistic model. We evaluated the performance 
of DBM-EDA on multiple instances of standard benchmark problems for combina¬ 
torial optimization, and compared the results to the state-of-the-art multivariate 
Bayesian Optimization Algorithm. 

DBM-EDA was able to solve most of the instances. However, its model quality 
was not competitive to BOA, with the exception of concatenated 5-Trap problems. 
Specifically, DBM-EDA was often unable to provide reliable convergence to the 
global optimum, or needed very large population sizes. Correspondingly, a high 
number of fitness evaluations were required. A reason for the insufficient model 
quality might be the mean-field approximation of the DBM’s training process. 
Mean-held approximation struggles if the probability distribution being approxi¬ 
mated is multimodal. However, this is often the case in the early EDA generations: 
A sample might resemble conhgurations of different local optima, perturbed by 
noise. Surprisingly, DBM-EDA was unable to solve the larger HIFF instance at all. 
This is despite the fact that, as a hierarchical model with multiple layers, DBM- 
EDA should be particularly well-suited for a hierarchical optimization problem like 
HIFF. 

DBM-EDA’ performance the concatenated trap problem with traps of size k = 5 
was superior to BOA. Here, DBM-EDA was able to solve the larger instances (75 
and 100 bit) with fewer htness evaluations. Recall that the problem is particularly 
difficult, as it is composed of subproblems which are deceptive. We hypothesize 
that the reason for the good performance is the structure of the DBM: The hid¬ 
den neurons are conditionally independent (see Equations |4] and [5|) . This matches 
the fitness function, which is additively decomposable, and separable. Figure [2] 
shows that neurons in the first hidden layer tend to model global optima to dif¬ 
ferent additive parts of the fitness function. White pixels indicate large positive 
weight values, black pixels large negative weight values. Each row visualizes the 
connections between a single hidden neuron (of the first hidden layer) and the 30 
problem variables. In the deceptive 5-trap problem, blocks of five adjacent variables 
have a strong contribution to the fitness, if all five variables are equal to one or 
equal to zero. Each block of five variables is independent of all other blocks. The 
figure shows that many hidden neurons strongly influence a single block of problem 
variables (bright/dark blocks of five adjacent pixels), and are indifferent to most 
other neurons (mid gray values). The learned representation of the model therefore 
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matches the problem structure. As the hidden neurons are activated stochastically, 
samples can comprise local optima of different additive terms of the fitness function, 
even if this specific combination has not been seen in the training data. A similar 
behavior has been observed for DAE-EDA, another EDA using neural networks 

m- 

In sum, while it is feasible to use a DBM as an EDA model, the effort for learning 
the multi-layered DBM model seems not to pay off for the optimization process in 
a noisy environment. There are multiple areas for future research. In the case 
where only 50% of the runs were required to find the global optimum, the results 
for DBM-EDA were quite encouraging. DBM-EDA could be a useful tool, if it 
could provide reliable convergence. The reasons to why this is currently not the 
case should be analyzed properly. Also, more work is necessary to understand why 
DBM-EDA was unable to apply the benefits of its hierarchical model to the HIFF 
problem. 
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